Interactive InfoVis Using User Centric Design¶
A Design Study Approach Using Kaggle Dataset¶
- Author: Rajesh Kutti
- Date: April 10, 2024
- University of Colorado Boulder - DTSA5304
Summary¶
- The goal of this project is to do Exploratory Data Analysis (EDA) and Visualizations (Infovis) to gain insights from the
Credit Card transactions dataset from Kaggle
- By doing an EDA on this dataset we will be able to identify the patterns o
finding fraud in Credit card transaction
- Design Studies Approach: A project that analyzes a real-world problem in order to design a visualization system that
contains a validated design where designers reflect about lessons learned s.
Required Libraries¶
- pip install altair vega_datasets --
- pip install geopandas --- https://geopandas.org
- pip install folium --- https://pypi.org/project/folium
- pip install mapclassify --- https://pypi.org/project/mapclassify
- pip install seaborn
Kaggle Dataset¶
- Credit Card Fraud Prediction
- Please signup to download data.
GEODATAFRAMES¶
- GeoPandas GeoDataFrames are like pandas dataframes with a geometry column. The geometry columns provipolygons, etc. for the map.
- Geopandas
- We will load the US census data from https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
- Shapefile for States : https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_500k.zip
Other Libs¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import altair as alt
import geopandas as gpd
import seaborn as sns
import folium
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib import pyplot
import warnings
warnings.filterwarnings("ignore")
Functions to load and clean datasets¶
In [2]:
# This function will return a cleaned up dataframe
# Clean the original dataset
# add few important columns
# prepare the data for next roung of experiments
# save the data for later use
# We will be createing few feature that will help us later
def loadCleanAndSave():
temp_csv_df0 = pd.read_csv("fraud_test.csv")
temp_csv_df1 = temp_csv_df0.drop('Unnamed: 0', axis = 1)
temp_csv_df1['cc_num'] = temp_csv_df1['cc_num'].astype('str')
temp_csv_df1['tran_date_time'] = pd.to_datetime(temp_csv_df1['trans_date_trans_time'], format="%d/%m/%Y %H:%M", errors='ignore')
temp_csv_df1['tran_date'] = temp_csv_df1['tran_date_time'].dt.strftime("%m/%d/%Y").astype('datetime64[ns]')
temp_csv_df1['tran_month'] = temp_csv_df1['tran_date'].dt.strftime('%m')
temp_csv_df1['tran_month_nm'] = temp_csv_df1['tran_date'].dt.strftime('%b')
temp_csv_df1['dob2'] = pd.to_datetime(temp_csv_df1['dob'], format="%d/%m/%Y", errors='ignore')
temp_csv_df1['age'] = 2024 - temp_csv_df1['dob2'].dt.year
#temp_csv_df1.index = temp_csv_df1['tran_date']
# Add more features
cols = ['tran_date', 'cc_num', 'amt']
temp_csv_df2 = temp_csv_df1[cols]
group_by_cols = ['tran_date', 'cc_num']
agg_cols = {'amt':'sum'}
temp_csv_df2['amt_prev'] = temp_csv_df2.groupby(group_by_cols)['amt'].transform('shift', 1).replace([np.nan],0)
temp_csv_df2.reset_index()
# Now we need to group by card number to get the previous date
group_by_cols = ['cc_num']
temp_csv_df2['tran_date_prev'] = temp_csv_df2.groupby(group_by_cols)['tran_date'].transform('shift', 1)
temp_csv_df2 = temp_csv_df2.reset_index()
temp_csv_df2['lst_day_diff'] = ((temp_csv_df2['tran_date'] -
temp_csv_df2['tran_date_prev']).dt.days).replace([np.nan],0)
temp_csv_df2['lst_amt_pct_chng'] = ((temp_csv_df2['amt'] - temp_csv_df2['amt_prev'])/
(temp_csv_df2['amt_prev'])).replace([np.inf, -np.inf],0).round(2)
temp_csv_df2_cols = ['tran_date', 'cc_num', 'amt', 'lst_day_diff', 'lst_amt_pct_chng']
temp_csv_df2 = temp_csv_df2[temp_csv_df2_cols]
temp_csv_df3 = pd.merge(temp_csv_df1, temp_csv_df2, how="left", on=['tran_date', 'cc_num', 'amt'])
cols = ['tran_date', 'tran_date_time', 'tran_month', 'tran_month_nm', 'cc_num', 'merchant',
'category', 'amt', 'gender', 'city', 'state', 'zip', 'lat',
'long', 'city_pop', 'age', 'trans_num', 'unix_time', 'merch_lat', 'merch_long',
'is_fraud', 'lst_day_diff', 'lst_amt_pct_chng']
result = temp_csv_df3[cols]
result.to_csv("fraud_expmt_v1.csv", index=False, header=True)
return result
def loadCensusDataset():
zipfile = "cb_2018_us_state_500k.zip"
result = gpd.read_file(zipfile)
print(f'type:{type(result)}')
return result
In [3]:
# Load cleaned up dataset
fraud_expmt_df0 = loadCleanAndSave()
fraud_expmt_df0 = pd.read_csv("fraud_expmt_v1.csv")
fraud_expmt_type_f_df0 = fraud_expmt_df0[fraud_expmt_df0.is_fraud == 1]
fraud_expmt_type_nf_df0 = fraud_expmt_df0[fraud_expmt_df0.is_fraud == 0]
col_names = " ".join(fraud_expmt_df0.columns.values)
print(f"coulumns: {col_names}")
fraud_expmt_df0.head(2)
coulumns: tran_date tran_date_time tran_month tran_month_nm cc_num merchant category amt gender city state zip lat long city_pop age trans_num unix_time merch_lat merch_long is_fraud lst_day_diff lst_amt_pct_chng
Out[3]:
| tran_date | tran_date_time | tran_month | tran_month_nm | cc_num | merchant | category | amt | gender | city | ... | long | city_pop | age | trans_num | unix_time | merch_lat | merch_long | is_fraud | lst_day_diff | lst_amt_pct_chng | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-06-21 | 2020-06-21 12:14:00 | 6 | Jun | 2.291160e+15 | fraud_Kirlin and Sons | personal_care | 2.86 | M | Columbia | ... | -80.9355 | 333497 | 56 | 2da90c7d74bd46a0caf3777415b3ebd3 | 1371816865 | 33.986391 | -81.200714 | 0 | 0.0 | 0.0 |
| 1 | 2020-06-21 | 2020-06-21 12:14:00 | 6 | Jun | 3.573030e+15 | fraud_Sporer-Keebler | personal_care | 29.84 | F | Altonah | ... | -110.4360 | 302 | 34 | 324cc204407e99f51b0d6ca0055005e7 | 1371816873 | 39.450498 | -109.960431 | 0 | 0.0 | 0.0 |
2 rows × 23 columns
Load the census dataset¶
In [4]:
gp_states_df = loadCensusDataset()
type:<class 'geopandas.geodataframe.GeoDataFrame'>
In [5]:
# Dataset aggregated by multiple columns
group_by_cols = ['tran_month', 'tran_month_nm', 'state', 'category', 'is_fraud']
agg_cols = {'amt': ['sum', 'count']} # creates array in one col
agg_cols = {'amt':'sum','tran_month':'count'} #aggregates separately
rename_cols = {'amt':'fnf_amt', 'tran_month':'fnf_ct'}
fraud_expmt_df1 = fraud_expmt_df0.groupby(
group_by_cols).agg(agg_cols).rename(columns=rename_cols).reset_index()
#fraud_expmt_df1.head(2)
f_df = fraud_expmt_df1[fraud_expmt_df1.is_fraud == 1]
f_df.head()
nf_df = fraud_expmt_df1[fraud_expmt_df1.is_fraud == 0]
#nf_df.head()
In [6]:
# Pivot data for different view by states
# Group the data by transaction-month, state, category and do a sum of fnf_amt and fnf_ct
fraud_expmt_df2 = pd.pivot_table(fraud_expmt_df1, index=['tran_month', 'tran_month_nm', 'state', 'category'],
columns=['is_fraud'],
aggfunc=np.sum, fill_value=0).add_prefix('ftype').reset_index()
# Create tabular data by removing the hierarchical columns
fraud_expmt_df3 = fraud_expmt_df2.set_axis([f"{x}{y}" for x, y in fraud_expmt_df2.columns], axis=1)
col_renames = {'ftypefnf_amtftype1':'f_amt', 'ftypefnf_amtftype0':'nf_amt', 'ftypefnf_ctftype1':'f_ct', 'ftypefnf_ctftype0':'nf_ct'}
# rename columsn to get the right fraud, nonfraud columnns in one row for each category
fraud_expmt_df3 = fraud_expmt_df3.rename(columns=col_renames)
fraud_expmt_df3.head(2)
# Prepare data for joint plots
fraud_expmt_type_f_df0.head(5)
jplot_cols = ['age', 'amt', 'category']
jplot_df = fraud_expmt_type_f_df0[jplot_cols]
Prepare Merchant data¶
In [7]:
# Aggregate the Fraud dataset by merchants
group_by_cols = ['merchant']
agg_cols = {'amt':'sum','tran_month':'count'} #aggregates separately
rename_cols = {'amt':'f_amt', 'tran_month':'f_ct'}
merch_df = fraud_expmt_type_f_df0.groupby(
group_by_cols).agg(agg_cols).rename(columns=rename_cols).reset_index()
#Get the Top 25 merchants based on fraud amounts
merchTopX = merch_df.sort_values('f_amt',ascending = False).head(25)
# Merge with Fraud dataset to get all observations for the Top50 merchants
merch_df2 = pd.merge(fraud_expmt_type_f_df0, merchTopX, how="right", on="merchant")
merch_df2.columns.values
Out[7]:
array(['tran_date', 'tran_date_time', 'tran_month', 'tran_month_nm',
'cc_num', 'merchant', 'category', 'amt', 'gender', 'city', 'state',
'zip', 'lat', 'long', 'city_pop', 'age', 'trans_num', 'unix_time',
'merch_lat', 'merch_long', 'is_fraud', 'lst_day_diff',
'lst_amt_pct_chng', 'f_amt', 'f_ct'], dtype=object)
Prepare data for geo spatial Analysis¶
In [8]:
# Group by state and category
# Dataset aggregated by multiple columns
group_by_cols = ['state', 'category']
agg_cols = {'nf_amt':'sum','f_amt':'sum', 'nf_ct':'sum','f_ct':'sum',} #aggregates separately
fnf_geo_by_state_df1 = fraud_expmt_df3.groupby(group_by_cols).agg(agg_cols).reset_index()
group_by_cols = ['state']
agg_cols = {'nf_amt':'sum','f_amt':'sum', 'nf_ct':'sum','f_ct':'sum',} #aggregates separately
fnf_geo_by_state_df2 = fraud_expmt_df3.groupby(group_by_cols).agg(agg_cols).reset_index()
index_cols = ['state', 'category', 'nf_amt', 'f_amt', 'nf_ct', 'f_ct']
index_cols = ['state']
fnf_geo_by_state_df3 = pd.pivot_table(fnf_geo_by_state_df1, index = index_cols,
columns=['category'],
aggfunc=np.sum, fill_value=0).add_prefix('ct_').reset_index()
# pivot each observation to wide format
fnf_geo_by_state_df4 = fnf_geo_by_state_df3.set_axis([f"{x}{y}" for x, y in fnf_geo_by_state_df3.columns], axis=1)
fnf_geo_by_state_df4['amount'] = fnf_geo_by_state_df4.iloc[:, 1:].sum(axis=1)
# merge the aggregated counts
fnf_geo_by_state_df5 = pd.merge(fnf_geo_by_state_df2, fnf_geo_by_state_df4, how="left", on="state")
# add tooltip
fnf_geo_by_state_df5['toolt'] = fnf_geo_by_state_df5['state'] +\
'\n Fraud count:' + fnf_geo_by_state_df5['f_ct'].astype(str) + \
'\n Fraud amount:' + fnf_geo_by_state_df5['f_amt'].astype(str) + \
'\n Non Fraud count:' + fnf_geo_by_state_df5['nf_ct'].astype(str) + \
'\n Non Fraud amount:' + fnf_geo_by_state_df5['nf_amt'].astype(str) + \
'\n Shopping pos amount:' + fnf_geo_by_state_df5['ct_f_amtct_shopping_net'].astype(str) + \
'\n travel amount:' + fnf_geo_by_state_df5['ct_f_amtct_travel'].astype(str)
select_cols = ['state', 'f_amt', 'f_ct', 'nf_amt', 'nf_ct', 'ct_f_amtct_shopping_net', 'ct_f_amtct_travel']
fnf_geo_by_state_df6 = fnf_geo_by_state_df5[select_cols]
ren_cols = {'f_amt':'fraud amount', 'f_ct':'fraud count',
'nf_amt':'non-fraud amount', 'nf_ct':'non-fraud count',
'ct_f_amtct_shopping_net' : 'shopping_net',
'ct_f_amtct_travel' : 'travel',
}
fnf_geo_by_state_df6 = fnf_geo_by_state_df6.rename(columns=ren_cols)
In [9]:
gp_states_df1 = gp_states_df[['NAME', 'STATEFP', 'STUSPS', 'geometry']]
gp_states_df2 = pd.merge(gp_states_df1, fnf_geo_by_state_df6, how="right", right_on="state", left_on="STUSPS")
Quick Insights about our dataset¶
In [10]:
records = fraud_expmt_df0.shape[0]
fnf_counts_df = fraud_expmt_df0.groupby('is_fraud')['is_fraud'].agg(['count']).reset_index()
fnf_counts_df['pct'] = round((fnf_counts_df['count'] / records) * 100,2)
fraud_ct = fnf_counts_df[fnf_counts_df.is_fraud == 1]['count'][1]
t_records = f'records: {records:,.0f}'
t_fraud_ct = f'fraud count: {fraud_ct:,.0f}'
atitle = alt.TitleParams(text='Fraud and NonFraud Counts',
subtitle = [t_records, t_fraud_ct],
anchor='middle')
base = alt.Chart(fnf_counts_df).mark_arc().encode(
theta="count",
color=alt.Color('is_fraud', legend=None)
).properties(
title=atitle,
height=100,
width=300,
)
pie = base.mark_arc(outerRadius=80)
text = base.mark_text(radius=20, size=20).encode(text="pct:N")
pie + text
Out[10]:
Timeseries Analysis¶
How is the fraud transaction amount distributed through out the year ?¶
- While this chart gives a good indication on how fraud amounts are per year it does not tell about the trends
- Are the trends on the rise ?
In [11]:
daily_fraud_df = fraud_expmt_type_f_df0.groupby('tran_date')['amt'].agg(['sum']).rename(columns={'sum':'amount'}).reset_index()
alt.Chart(daily_fraud_df).mark_line(tooltip={'content': 'data'}).encode(
x='tran_date:T',
y='amount:Q'
).properties(
title='Fraud losses(Amount) by Transaction date',
height=150,
width=700
)
Out[11]:
Is there a increase or decrease in trends?¶
- The increase in fraud per month gives us clear picture
- We also came to know that there were outliers in the data for June and July
In [12]:
daily_fraud_df['amount_lag1'] = daily_fraud_df['amount'].shift(1)
daily_fraud_df.fillna({'amount_lag1': 0}, inplace=True)
daily_fraud_df['amt_change_pct'] = (daily_fraud_df['amount'] - daily_fraud_df['amount_lag1'])/daily_fraud_df['amount_lag1']
alt.Chart(daily_fraud_df).mark_line(tooltip={'content': 'data'}).encode(
x='tran_date:T',
y='amt_change_pct:Q'
).properties(
title='Precentage change in Fraud losses(Amount) by Transaction date',
height=150,
width=700
)
Out[12]:
Univariate: How is the amount distributed across the categories?¶
In [13]:
# Simple chart
custom_dims = (5, 3)
fig, ax = pyplot.subplots(figsize=custom_dims)
sns.histplot(ax=ax, data=fraud_expmt_type_f_df0, x="amt", log_scale=False, fill=False)
ax.set_title("Simple Amount distribution for Fraud transactions")
Out[13]:
Text(0.5, 1.0, 'Simple Amount distribution for Fraud transactions')
In [14]:
custom_dims = (5, 4)
fig, ax = pyplot.subplots(figsize=custom_dims)
sns.histplot(ax=ax, data=fraud_expmt_type_f_df0, x="amt", bins=20, hue="category", element="step", multiple="dodge")
sns.move_legend(ax, "upper right", bbox_to_anchor=(1.5, 1), title='Categories')
ax.set_title("Intuitive approach - Amount distribution for Fraud transactions")
Out[14]:
Text(0.5, 1.0, 'Intuitive approach - Amount distribution for Fraud transactions')
In [15]:
# Distributon for column lst_amt_pct_chng percentage-of-amount-change from previous transaction
custom_dims = (5, 4)
df1 = fraud_expmt_type_f_df0[fraud_expmt_type_f_df0.lst_amt_pct_chng < 150]
fig, ax = pyplot.subplots(figsize=custom_dims)
sns.histplot(ax=ax, data=df1, x="lst_amt_pct_chng", bins=20,
hue="category", element="poly", multiple="dodge")
sns.move_legend(ax, "upper right", bbox_to_anchor=(1.5, 1), title='Categories')
ax.set_title("Intuitive approach - Daily pct change in amount for Fraud transactions")
Out[15]:
Text(0.5, 1.0, 'Intuitive approach - Daily pct change in amount for Fraud transactions')
Bivariate analysis - What are the correlations ?¶
In [16]:
### Coorelations between the variables in fraud dataset
In [17]:
df_cols = ["amt", "lst_amt_pct_chng", "age","lst_day_diff" ]
df = fraud_expmt_df0[df_cols]
cor_data =df.corr().stack().reset_index().rename(columns={0: 'correlation', 'level_0': 'variable', 'level_1': 'variable2'})
cor_data['correlation_label'] = cor_data['correlation'].map('{:.2f}'.format) # Round to 2 decimal
cor_data.head()
Out[17]:
| variable | variable2 | correlation | correlation_label | |
|---|---|---|---|---|
| 0 | amt | amt | 1.000000 | 1.00 |
| 1 | amt | lst_amt_pct_chng | 0.402820 | 0.40 |
| 2 | amt | age | -0.012861 | -0.01 |
| 3 | amt | lst_day_diff | 0.012159 | 0.01 |
| 4 | lst_amt_pct_chng | amt | 0.402820 | 0.40 |
In [18]:
base = alt.Chart(cor_data).encode(
x='variable2:O',
y='variable:O'
)
# Text layer with correlation labels
# Colors are for easier readability
text = base.mark_text().encode(
text='correlation_label',
color=alt.condition(
alt.datum.correlation > 0.5,
alt.value('white'),
alt.value('black')
)
)
# The correlation heatmap itself
cor_plot = base.mark_rect().encode(
color='correlation:Q'
).properties(
height=250,
width=250,
title='Coorelation matrix'
)
cor_plot + text # The '+' means overlaying the text and rect layer
Out[18]:
In [19]:
#Take a sample for running correlation
sample_df = fraud_expmt_df0.sample(frac=0.009)
sample_df = sample_df[:5000]
alt.Chart(sample_df).mark_circle().encode(
alt.X(alt.repeat("column"), type="quantitative"),
alt.Y(alt.repeat("row"), type="quantitative"),
color="category",
).properties(
title='Correlation chart for multiple variables',
width=200,
height=200
).repeat(
row=["amt", "lst_amt_pct_chng"],
column=["age", "amt", "lst_day_diff" ]
)
Out[19]:
Descriptive Statistics - Joint plots¶
- The joint plot will show the age and amount distribution
- The category helps us to understand how the data is distributed by category
- Shopping net and misc net have more transactions for more amounts
In [20]:
ax = sns.jointplot(data = jplot_df,
x='age', y = 'amt', hue="category",
kind="hist", marginal_ticks=True, palette="Set2" )
ax.fig.set_figheight(4)
ax.fig.set_figwidth(7)
ax.fig.suptitle("Fraud transactios - joint-plot by age, amount, category",
fontsize=11, fontdict={"weight": "bold"})
ax.figure.subplots_adjust(top=0.9);
#ax.ax_joint.legend(loc='upper right')
sns.move_legend(ax.ax_joint, "upper right", bbox_to_anchor=(1.52, 1.3), title='Categories')
Violin plot by Categories¶
- The health and fitness categories were having the most counts
In [21]:
ax1a = sns.violinplot(data=f_df, x="category", y="fnf_amt",
scale='count', inner="quart", height=4, aspect=1.25)
ax1a.set_xticklabels(ax1a.get_xticklabels(), rotation=90);
ax1a.set_title('Fraud Distributions by amount and category for the year')
ax1a.figure.set_size_inches(5,3)
In [22]:
ax2a = sns.violinplot(data=nf_df, x="category", y="fnf_amt", scale="count", inner="quart")
ax2a.set_xticklabels(ax2a.get_xticklabels(), rotation=90);
ax2a.set_title('Non Fraud Distributions by amount and category for the year')
ax2a.figure.set_size_inches(5,3)
Descriptive stats to show confidence interval for amouts¶
- use the ci0 and ci1 mark to plot the confidence interval of the estimate of the mean amounts:
- use color to indicate male and female
In [23]:
alt.Chart(fraud_expmt_type_f_df0).mark_area(opacity=0.3, tooltip={'content': 'data'}).encode(
x=alt.X('tran_date', timeUnit='month'),
y=alt.Y('ci0(amt)', axis=alt.Axis(title='Fraud Amount')),
y2='ci1(amt)',
color='gender'
).properties(
title='Fraud amounts by gender using 95% confidence intervals band.',
width=300,
height=200
)
Out[23]:
Intuitive approach to show correlations in an interactive way¶
- By selecting a category, we can get information about the amount and age bins of the fraud transactions
- The color of the rectangular blocks indicate the no of records for intersection of age and amount
In [24]:
# Let's implement filtering using dynamic queries.
selection = alt.selection(type="multi", fields=["category"])
# Create a container for our two different views
base = alt.Chart(fraud_expmt_type_f_df0).properties(width=500, height=250)
# Let's specify our overview chart
overview = alt.Chart(fraud_expmt_type_f_df0).mark_bar(tooltip={'content': 'data'}).encode(
y = "median(amt)",
x = "category",
color=alt.condition(selection, alt.value("gold"), alt.value("lightgrey"))
).add_selection(selection)
overview = overview.properties(
height=250,
width=250,
title='Interactive chart: Fraud Amounts by Category, Age and Amount'
)
detail = base.mark_rect(tooltip={'content': 'data'}).encode(
x=alt.X('age', bin=True),
y=alt.Y('amt', bin=True),
color='count()'
).transform_filter(selection).properties(height=250, width=250)
overview | detail
Out[24]:
Display Fraud counts by state, amount, transaction-month, categories¶
In [25]:
chart_title = 'Interactive: State Vs Amount - On selection display Month-wise Categories and Amounts for the selected State'
# Let's implement filtering using dynamic queries.
state_selection = alt.selection(type="multi", fields=["state"])
# Let's specify our overview chart
overview = alt.Chart(fraud_expmt_df3).mark_bar(tooltip={'content': 'data'}).encode(
y = "sum(f_amt)",
x = "state",
color=alt.condition(state_selection, alt.value("tomato"), alt.value("lightgrey"))
).add_selection(state_selection)
overview = overview.properties(
height=250,
width=550
)
# Let's specify our chart
detail = alt.Chart(fraud_expmt_df3).mark_circle(tooltip={'content': 'data'}).encode(
x = "tran_month_nm",
y = "f_amt",
color=alt.Color('category', scale=alt.Scale(scheme='goldorange')),
).properties(
height=200,
width=200
).add_selection(state_selection).transform_filter(
state_selection
)
chart = overview | detail
# Apply title for the entire chart
chart = chart.properties(
title = chart_title
)
chart
Out[25]:
Geospatial data for looking a state wise Fraud Counts¶
In [26]:
# Lets center the map using google search for lat long for State of Louisiana, set zoom level to 4
latitude = 30.51
longitude = -91.52
gp_states_df2.explore(
column="fraud amount", # make choropleth based on "BoroName" column
#tooltip="toolt", # show "BoroName" value in tooltip (on hover)
popup=True, # show all values in popup (on click)
tiles="CartoDB positron", # use "CartoDB positron" tiles
cmap="Set3", # use "Set1" matplotlib colormap
style_kwds=dict(color="navy"), # use black outline
legend_kwds={"label": "Amounts", "orientation": "horizontal"},
zoom_start=4,
location=[latitude, longitude]
)
Out[26]:
Make this Notebook Trusted to load map: File -> Trust Notebook
Merchant Analysis¶
Display the Interactive chart for Top 25 Merchants¶
In [27]:
# Let's implement filtering using dynamic queries.
state_selection = alt.selection(type="multi", fields=["merchant"])
# Let's specify our overview chart
overview = alt.Chart(merch_df2).mark_bar(tooltip={'content': 'data'}).encode(
y = "sum(f_amt)",
x = "merchant",
color=alt.condition(state_selection, alt.value("tomato"), alt.value("lightgrey"))
).add_selection(state_selection)
overview = overview.properties(
height=250,
width=450,
title='Interactive chart: Fraud Amounts for Top 10 merchants'
)
detail = alt.Chart(merch_df2).mark_circle(tooltip={'content': 'data'}).encode(
x = "amt",
y = "state",
color = "state",
size='f_amt',
).properties(
height=450,
width=400
).add_selection(state_selection).transform_filter(
state_selection
)
chart = overview | detail
chart
Out[27]:
Appendix¶
References¶
In [ ]: